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1.  Items  of  Progress 

I 

During  the  1977  contract  year  our  work  for  the  Air  Force 
Office  of  Scientific  Research  has  accomplished  the  following 
goals : 

1.  Acquisition  of  a carefully  labeled  and  segmented  data 
base  of  connected  speech  for  the  testing  of  segmentation 
and  phonetic  identification  algorithms. 

2.  Development  and  testing  against  the  data  base  of  a segmen- 
tation algorithm  based  on  rate  of  spectral  change  and  rms 
energy. 

3.  Initiation  of  a study  of  inter-speaker  vowel-formant  scal- 
ing based  on  a 2-dimensional  constraint  on  a speaker's 
first  three  formant  frequencies. 

2 . Description  of  Progress 
2.1.  Connected  Speech  Data  Base 

2.1.1.  Motivation.  One  of  the  main  bottlenecks  for 
developing  effective  procedures  for  the  processing  and 
recognition  of  continuous  speech,  especially  uncon- 
strained conversational  speech,  is  the  acquisition  of 
carefully  labeled  speech  data  to  test  algorithms  for 


I 

i 

I 


phonetic  analysis.  A part  of  the  past  year's  effort 
has  therefore  been  directed  toward  building  up  such  a 
data  base.  This  data  base  should  be  a valuable  resource 
for  the  testing  of  any  future  algorithms  for  analyzing 
continuous  speech. 


* 
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To  date,  about  15  seconds  of  continuous  speech  has 
been  analyzed  according  to  the  following  procedures. 
2.1.2.  Marking  Procedures.  Using  the  Interactive 
Laboratory  System  (ILS)  developed  under  previous  sup- 
port from  AFOSR,  a segment  of  speech  of  1-3  minutes 
duration  is  stored  on  disk  as  a waveform  bandlimited 
to  5 kHz  and  sampled  at  10  kHz.  The  displayed  wave- 
form, formant  frequencies,  and  rms  energy  are  used 
together  with  repeated  audio  playback  to  assist  the 
operator  in  making  the  best  possible  judgment  for  seg- 
mentation and  transcription  of  each  100-frame  interval 
(640  ms)  of  the  waveform.  Hard  copies  of  the  waveform 
and  parameter  display  are  preserved  with  their  segmen- 
tation markers  and  phonetic  transcriptions.  At  the 
same  time,  a label  file  is  prepared  which  contains  a 
greatly  simplified  form  of  the  transcription.  Each 
segment  is  marked  as  a V,  S,  N,  C,  or  Z depending  on 
its  classification,  respectively,  as  a vowel  (V) , 
sonorant  constant  (w,  1,  r,  j)  (S) , nasal  consonant 
(N) , other  consonant  (C) , or  non-speech  (silence  or 
other  non-speech)  (Z) . Ultimately,  it  would  be  desir- 
able to  encode  the  full  phonetic  transcription  in  the 
label  file  so  that  all  the  available  phonetic  informa- 
tion could  be  accessible  to  label-referenced  algorithms. 
The  labor  involved  far  the  present  encoding  system 
would  make  this  not  presently  cost-effective. 


F' 


2.1.3.  Control  on  Subjectivity.  Because  even  the  best 
human  transcriptions  of  connected  speech  contain  some 
uncontrolled  subjective  factor,  the  above  process  is 
repeated  on  each  speech  segment  by  two  transcribers 
working  independently.  After  a transcription  is  com- 
pleted by  both  workers,  disagreements  between  the 
transcriptions  are  noted  and,  by  working  together, 

the  transcribers  then  resolve  most  of  the  disagree- 
ments by  discussion  and  re-examination  of  the  data. 

Usually,  agreement  is  easily  achieved.  There  is,  how- 
ever, always  some  residual  disagreement  or  uncertainty 
in  difficult  parts  of  the  transcription.  These  are  left 
as  points  of  ambiguity  in  the  final  transcription. 

2.1.4.  Consistency.  A comparison  of  the  two  transcriptions 
for  the  same  15  second  interval  showed  that  one  experi- 
menter transcribed  and  labelled  125  segments,  while  the 
other  experimenter  transcribed  and  labelled  135  segments. 
While  the  number  of  segments  differed  by  10,  the  number 

of  discrepancies  between  the  two  transcriptions  was  19. 

It  is  found  that  the  two  transcribers  are  fairly 
consistent  with  each  other  in  their  placement  of  seg- 
ment boundaries.  A one-  or  two-  frame  discrepancy  is 
not  uncommon,  and  the  average  placement  is  consistent 
to  within  about  10  ms.  Specifically,  33  percent  of  the 
boundaries  are  in  perfect  agreement,  67  percent  of  the 
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boundaries  are  within  6.4  msec  or  less,  and  83  percent 
of  the  boundaries  are  within  12.8  msec  or  less. 


These  figures,  then,  provide  a useful  guideline 
for  evaluating  automatic  segmentation  algorithms: 
their  agreement  with  human  transcription  need  not  be 
better  than  the  agreement  of  the  humans  with  each 
other.  Note  that  the  figure  of  about  10  ms  is  of 
the  same  order  as  a single  pitch  period  of  a male 
voice;  it  might  therefore  be  taken  to  represent  a 
measure  of  inherent  uncertainty  of  event  timing  in 
speech. 

2.2.  Segmentation  Algorithm 

2.2.1.  Description . In  order  to  build  up  a substantial 
data  base  for  meaningful  studies  of  conversational 
speech  data,  it  is  expected  that  automatic  algorithms 
will  be  necessary  for  segmenting  and  labeling  speech 
events.  One  such  algorithm  has  been  devised  for  the 
automatic  detection  of  segment  boundaries.  This  is 
accomplished  by  computing  the  spectral  variance  of  the 
speech  signal  as  a function  of  time.  The  variance 
function  tends  to  have  local  maxima  in  the  transition  1 

region  between  sounds  and  local  minima  in  sounds  which 
can  have  steady-state  characteristics.  Thus  a poten- 
tial segment  boundary  is  placed  at  the  location  of  peaks 
I in  the  variance  function.  All  potential  boundary 

1 

I 
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markers  are  displayed  on  the  graphics  terminal  for 
visual  verification,  but  the  operator  must  still 
identify  and  label  the  marked  segments.  This  boun- 
dary detection  algorithm  is  speaker-independent  and 
operates  reliably  on  unconstrained  speech. 

2.2.2.  Human  vs.  Machine  Marking.  The  performance 
of  the  automatic  algorithm  was  evaluated  at  one  level 
by  comparing  the  location  of  vowel  boundaries  placed 
by  the  machine  versus  those  placed  by  one  of  the 
transcribers.  It  was  found  that  there  was  complete 
agreement  on  33  percent  of  the  initial  vowel  bound- 
aries. Furthermore,  66  percent  of  the  initial  vowel 
boundaries  and  52  percent  of  the  final  vowel  boundaries 
were  within  6.4  msec  of  each  other.  Also,  81  percent 
of  the  initial  vowel  boundaries  and  81  percent  of  the 
final  vowel  boundaries  were  within  12.8  msec  of  each 
other.  Note  that  this  is  very  close  to  the  level  of 
agreement  between  the  two  transcribers. 

These  results  are  very  encouraging  and  demonstrate 
the  effectiveness  of  the  spectral  variance  function  in 
locating  vowel  boundaries.  It  is  expected  that  this 
algorithm  could  be  the  foundation  for  a totally  auto- 
matic segmentation  and  labeling  process. 


5 


2.3. 


Vowel  Scaling  Study 

2.3.1.  Background.  As  reported  by  Broad  and  Wakita 
(1977),  a large  sampling  of  a given  speaker's  first 
three  vowel  formant  frequencies  cluster  near  a 2-part 
2-dimensional  surface  of  the  form: 

“l  ^1  °‘2  ^2  “3  ^3  “4  ~ ^ 

for  (1) 

Pi  + ^2  ^2  ^3  ^3  ^4  =0 

otherwise . 

They  noted  that  this  constraint  on  the  formant  fre- 
quencies had  important  inplications  for  the  problem  of 
inter-speaker  formant  scaling.  In  particular,  the 
hypothesis  of  uniform  scaling  would  imply  that  any 
speaker’s  distribution  of  vowel  formant  frequencies 
should  have  the  form  (1)  modified  only  by  replacing 
and  3^  with  a^/k  and  B^/k,  where  k is  a speker- 
dependent  scaling  factor  related  to  the  average  vocal 
tract  length. 

2.3.2.  New  Results.  To  study  this  aspect  of  inter- 
speaker  scaling,  we  have  been  collecting  data  on  the 
vowels  of  4 new  speakers.  The  study  is  to  include 
2-6  additional  speakers  beyond  these.  The  data  col- 
lected so  far  on  the  4 speakers  as  reduced  to  the 
form  (1)  are  shown  in  Table  I together  with  those  of 
the  speaker  studied  by  Broad  and  Wakita.  These 


6 


Table  I.  Parameters  of  the  representation  (1)  for  four  speak, 
compared  to  those  of  the  speaker  (F2)  studied  by  Broad  and  Wakita 
(1977).  N is  the  number  of  samples  used  for  each  speaker,  andcri 
the  rms  distance  of  the  samples  from  the  representation  (1).  The 


preliminary  results  suggest  the  following  tentative 
conclusions : 

(1)  The  twc-plane  form  (1)  provides  a satis- 
factory description  of  vowel  formant  fre- 
quencies across  speakers,  inasmuch  as  the 
rms  spread  of  the  data  about  these  planes 
is  about  the  same  for  all  the  speakers. 

(2)  The  various  speakers'  representations  are 

similar  at  least  to  the  extent  that  all 
their  direction  cosines  (a^^-  a^,  - 3^) 

have  the  same  respective  signs,  as  do  the 
average  offsets  in  Hz  (a^  and  3^). 

(3)  The  a’s  and  3's  are  quite  variable  from 
speaker  to  speaker,  suggesting  that  the 
hypothesis  of  uniform  scaling  cannot  be 
supported.  This  will  be  determined  con- 
clusively from  a more  formal  analysis  of 
the  complete  data  set. 

Whether  some  modified  form  of  uniform  scaling  can 
be  made  to  work  remains  to  be  seen. 
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3.  Publications 


A revision  of  the  following  paper  has  been  re-submitted 
to  the  IEEE  Transactions  on  Acoustics,  Speech,  and  Signal 
Processing; 

L.  L.  Pfeifer,  An  Interactive  Laboratory  System  for 
Research  in  Speech  and  Signal  Processing. 


The  following  manuscript  is  near  completion  and  will 
be  submitted  to  the  same  journal: 

L.  L.  Pfeifer,  Methodologies  for  Acoustic  Studies  of 
Vowels  in  Conversational  Speech. 
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